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Abstract 

Phylogenetic trees are a central tool in understanding evolution. They are typically 
inferred from sequence data, and capture evolutionary relationships through time. It 
is essential to be able to compare trees from different data sources (e.g. several genes 
from the same organisms) and different inference methods. We propose a new metric for 
robust, quantitative comparison of rooted, labeled trees. It enables clear visualizations of 
tree space, gives meaningful comparisons between trees, and can detect distinct islands 
of tree topologies in posterior distributions of trees. This makes it possible to select well- 
supported summary trees. We demonstrate our approach on Dengue fever phylogenies. 


1 Introduction 

Phylogenetic trees are fundamental tools for understanding evolution. Improvements in se¬ 
quencing technology have meant that phylogenetic analyses are growing in size and scope. 
However, when a tree is inferred from data there are multiple sources of uncertainty. Com¬ 
peting approaches to tree estimation can produce markedly different trees. Trees may conflict 
due to signals from selection (e.g. convergent evolution), and/or when derived from different 
data (e.g. the organisms’ mitochondrial vs nuclear DNA, individual genes or other subsets 
of sequence data USD- Evolution is not always tree-like: species trees differ from gene trees, 
and many organisms exchange genes through horizontal gene transfer. It is therefore crucial 
to be able to compare trees to identify these signals. 

Trees can be compared by direct visualization, aided by methods such as tanglegrams and 
software such as DensiTree |1], but this does not lend itself to detailed comparison of large 
groups of trees. Current quantitative methods for tree comparison suffer from the challenges 
of visualizing non-Euclidean distances m and from counter-intuitive behavior. For example, 
the nearest-neighbor interchange (NNI) distance of Robinson and Foulds (RE) [25], which is 
the most widely used, is hampered by the fact that large NNI distances do not imply large 
changes among the shared ancestry of most tips [SDEOIIIT]. In fact, two trees differing in 
the placement of a single tip can be a maximal NNI distance apart. 

We introduce a metric which flexibly captures both tree structure and branch lengths. 
It can be used as a quantitative tool for comparing phylogenetic trees. Each metric on trees 
defines a tree space] this tree space lends itself to clear visualizations in low dimensions, and 
captures and highlights differences in trees according to their biological significance. 

In Section we formally define our distance function, prove that it is a metric, and 
explain its capacity to capture tree structure and branch lengths. We also provide a brief 
survey, explaining how our metric relates to and differs from existing metrics (Section |2.3[ ). 
In Section we explain some of the applications of our metric. We show how our metric 
enables visualization of tree space (Section |3.1[ ) and detection of islands (Section |3.2[ ), which 
we demonstrate with a simple application to Dengue fever phylogenies. We also explain how 
our metric provides a new suite of methods for selecting summary trees in Section |3.3[ We 
conclude with some ideas for extensions to our metric in Section [H 
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2 Metrics 


2.1 Our metric: definition and proof 


Let 7fc be the set of all rooted trees on k tips with labels 1,..., fc. In common with previons 
literature [HIES] we say that trees Ta^Tij G Tk have the same labeled shape or topology if 
the set of all tip partitions admitted by internal edges of is identical to that of T^, and 
we write this as = T^. We say that Ta = if they have the same topology and each 
corresponding branch has the same length. 

For any tree G 7^ let rrii^j be the number of edges on the path from the root to the 
most recent common ancestor (MRCA) of tips i and j, let Mij be the length of this path, 
and let pi be the length of the pendant edge to tip i. Then, including all pairs of tips, we 
have two vectors: 


m{T) 


(^1,2, ^1,3;... 1, , IJ , 

k times 


which captures the tree topology, and 


M{T) = ^1,3, • • • 5 Mk-i^k^Pi^ • • • ^Pk) 

which captures the topology and the branch lengths. The vector M{T) is similar to the 
vector of cophenetic values [29l [5] (Section |2.3[ ). We form a convex combination of these 
vectors, parameterized with A G [0,1], to give 

vy^{T) = (1 - A)m(r) + AM(r) . 

Figure [^provides an example of this calculation for two small trees. 

(A,B) {A,C) {A,D) {B,C) (fi,D) (C,D) pa Pb Pc PD 

m{Ti) 1 0 0 0 0 11111 

M{Ti) 0.5 0 0 0 0 1.1 1.2 0.8 0.8 1 

Vx(Ti) = (l-X)m{n) + XM(n) 

dx{T„T2)^\\vx{n)-vx{T2)\\ 

(A,B) {A,C) {A,D) {B,C) (6,0) (C, D) pa Pb Pc PD 
m(T2) 2 1 0 1 0 01111 

M(T2) 1.2 0.9 0 0.9 0 0 0.8 1.4 0.7 1 

''A(L2) = (l-A)/T7(72) + AM(r2) 


Topology Topology and lengths 





Figure 1: A tree is characterized by the vectors m and M, which are calculated as shown. 
These are used to calculate the distance between the trees for any A G [0,1]. Here, 
do(7i,T2) = 2 and di(Ti,T 2 ) = 1.96. 


A metric is a mathematical notion of distance; specifying a metric gives structure and 
shape to a set of objects, forming a space. A function d{Ti^T 2 ) is a metric if, for all Ti,T 2 G 
Tk, 

1. (i(Ti,T 2 ) > 0 (distances are non-negative) 

2. (i(Ti, T 2 ) = 0 Ti = T 2 (the distance is only 0 if they are the same) 

3. (i(Ti,T 2 ) = d{T 2 ,Ti) (distance is symmetric) 
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4. for any T 3 G 7^, d{Ti^T 2 ) < d{Ti^T^) + d{T^^T 2 ) (the triangle inequality) 

Theorem 1. The function dx : Tk x Tk ^ ^ given by 

dx{Ta,n) = \\vx{Ta)-Vx{n)\\ 

is a metric on Tk, where || • || is the Euclidean distance (P-norm) and X G [0,1]. 

Proof. Since the Euclidean distance between vectors satisfies the conditions (1), (3) and (4) 
for being a metric, it remains to prove that do{TajTi)) = 0 = T 5 (i.e. the distance is 0 

with A = 0 if and only if the trees have the same topology) and dx{Ta^ T}f) = 0 for 

all A G (0,1] (i.e. the distance is 0 for 0 < A ^ 1 if and only if the trees are identical). We will 
address this in three stages, showing that ( 1 ) the tree topology vector, ( 2 ) the branch-length 
focused vector, and (3) their convex combination each uniquely define a tree. That is, we 
show that for Ta^T^ G 7^, 

1 . m{Ta)=m{n)^Ta = n, 

2 . M{Ta) = M{n) ^Ta = n, and 

3 . for A G (0,1), r;A(r<j) = vx{Ta) Ta = U. 

For ease of notation we restrict our attention here to binary trees; it is straightforward to 
extend these arguments to trees that are not binary. 

1. We show that m(T) characterizes a tree topology. Suppose that for Ta^T^ ^ Tk have 
do{Ta^Ti)) = 0, so mij{a) = mij{b) for all pairs i, j G 1,..., A;. Consider the tip partition 
created by the root of Ta. That is, if the root and its two descendant edges were removed, 
then Ta would be split into two subtrees, whose tip sets we label L and R. For all leaf pairs 
(i, j) with i G L and j G i? we have mij{a) = 0, and therefore mij{b) = 0. Thus the root of 
T also admits the partition 

Similarly, any internal node n in Ta partitions its descendant tips into non-empty sets 
L^, Rji. Let the number of edges on the path from the root to n be For all leaf pairs (i, j) 
with i G j G Rn we have mij{a) = Xn — mij{b)^ and so there must also be an internal 
node in T which partitions the leaves into the sets L^, R^. Since this is true for all internal 
nodes, and hence all internal edges, we have Ta = T 5 , and do is a metric on tree topologies. 
Note that the final k fixed entries of m(T) are redundant for unique characterization of the 
topology of the tree, but are included to allow the convex combination of the topological and 
branch-length focused vectors. 

2. We show that M{T) characterizes a tree using a similar argument to that of part (1). 
Suppose that for Ta^T ^ Tk we have di{Ta^T) = 0, so Mij{a) = Mij{b) for all pairs 
i, j G 1,..., fc. Let the length of the path from the root to internal node n be Then for 
all i G j G Rn we have Mij{Ta) = Xn = Mij{Tk)^ which means that T also contains an 
internal node at distance Xn from the root which admits the partition {L^, Rn}- Since this 
holds for all internal nodes including the root (where Xn = 0 ), we have that Ta and T have 
the same topology and internal branch lengths. 

The final k elements of M{T) correspond to the pendant branch lengths. When M{Ta) = 
M(Tk) we have that for each i G 1,..., A: the pendant branch length to tip i has length pi in 
both Ta and T^. Thus Ta and T have the same topology and branch lengths, hence Ta = T 
and di is a metric. 
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Figure 2: If d{Ta^Ti)) = 0 then Ta and must share the same root partition, hence S 2 is 
the same set of tips in both trees. If mx^y{Ta) 7^ '^x,y{Tb)^Tnx^y{Ta) — mx^y{Ti)) = n (here 
'^x,y{Ta) — "^x.yiTh) = 5 — 2 = 3), then there exist at least 3 tips 2:1,2:2, 2:3 between the root 
and the MRCA of x and y in T^, but positioned further from the root than the MRCA of x 
and ^ in T5. 

3. Finally, we need to show that vx{T) characterizes a tree for A G (0,1). Suppose that for 
Ta.Ti, G Tk and A G (0,1) we have dx(Ta,T},) = 0, so vx(Ta) = vx(Ti,). 

Each vector has length (2) + ^ is clear that for the final k entries, that is for 

< z < we have 

0 = (1 - A)(l - 1) + X{M,{Ta) - M,{n)) 

which implies that Mi(Ta) — Mi(Ti)). 

We therefore restrict our attention to the first (2) elements of vx- Now dx{Ta^Ti,) = 0 
implies that 

0 = (1 - \){m,^j(Ta) - ( 1 ) 

for all z, j G 1,..., /c. We show that, for any A G (0,1), although it is possible for Equation[^ 
to hold for some z, j G 1,..., A: it will only hold for all z, j G 1,..., A; when Ta — T^. 

Suppose for a contradiction that we have Ta 7^ T^ but dxiTa^Ts) = 0. First, observe 
that if mij{Ta) = 0 then Mij{Ta) = 0, which forces mij{Ts) = Mij{Ts) = 0, and so 
dx{Ta^Ti)) = 0 implies that Ta and Ti^ must share the same root partition. Now fix A G (0,1) 
and consider a pair of tips x, y G 1,..., A: with mx^y{Ta) 7^ mx^y{Ts), mx^y{Ta),mx^y{Ts) 7^ 0, 
which must exist since Ta ^ Ti^^ using part (1). Without loss of generality, suppose that 
TT^x,y{ci) — '^x,y(5) = zz, whcrc n G N. Then there exist at least n tips zi,..., 2;^ for which, 
because the trees have the same root partition, we have 

"^x^ZiiTa) = TTly^z^{Ta) Tflx^yiTa) 

and 

'f^x,Zi{Ti)) ^ '^x,y{Ti)) ^ Tfly^ZiiT})) > '^x.yiTjj) , 

for each z G l,...,n (see Figure [^. Pick Zj so that mx^Zj{Ta) — ^^'^ie[n]'^x,Zi{Ta). Then 
mx^zj{Ta) - mx^zj{Th) < mx^y{Ta) -n - mx^y{Ti,) = n - n = 0. Now since Equationholds 
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for all i, j G 1,..., /c, we have 
0 > (Ta) - mx,zj (Tb) = ^ 

= ^ ^ {Mx,y{Tb) — Mx^y{Ta) + Mx^y{Ta) — Mx^Zj{Ta)) 

But Mx,y{Tb) - Mx^yiTa) = (^) n> 0 and Mx^y{Ta) - Mx,zj{Ta) > 0 so we have a contra- 
diction. Thus Equation [^cannot hold for all i, j G 1,..., fc, so dx{Ta^ T^) = 0 ^ = T^. □ 

Our metric is fundamentally for rooted trees. A single unrooted tree, when rooted in two 
different places, produces two distinct rooted trees, and our distance between these will be 
positive. It will be large if the two distinct places chosen for the roots are separated by a 
long path in the original unrooted tree. However, it would be straightforward to check if two 
trees have the same (unrooted) topology in our metric: root both trees on the edge to the 
same tip and find the distance. Re-rooting a tree will induce systematic changes in v(T)^ 
with some entries increasing and others decreasing by the same amount. The metric dx is 
invariant under permutation of labels. That is, for trees Ta and T}) and a label permutation 
cr, dx{Ta,Tb) = dx{T^,T^). 

We note that alternative, similar definitions for a metric on Tk are possible. In particular, 
the metric defined by 

Dx{Ta,Tb) = (1 - A)||m(r„) - m{Tb)\\ + X\\M{Ta) - M{Tb)\\ 

gives similar behavior to the metric we have used. The difference between the two is that in 

the Euclidean distances are taken between the m and M vectors before they are weighted 
by A. Rather than a Euclidean distance between two vectors {v for each tree), D is a, 
weighted sum of two different metrics: the distance between m{Ta) and m(Tif) (first term in 
the above), and between M{Ta) and (second term). A benefit of Dx is that it is linear 

in A, so that the changes as A moves from 0 to 1 are more intuitive. A disadvantage is that 
Dx itself is not Euclidean, leading to (typically only slightly) poorer-quality visualization in 
MDS plots (Section [S.!] ). 

2.2 The role of A 

The parameter A allows the user to choose to what extent the branch lengths of a tree, vs 
its topology alone, contribute to the tree distance. The distance between two trees may 
increase or decrease as A increases from 0 to 1. Since the topology-based vector, m, contains 
the number of edges along paths in the tree, and M contains the path lengths, the branch 
lengths are implicitly compared to 1 in the convex combination v. In other words, if the 
branch lengths are much larger than 1, then the entries of M will be much larger than the 
corresponding entries of m, and M will dominate in the expression for v even when A is 
relatively small. Conversely, if the branch lengths are much less than 1, the entries of M will 
be much less than those of m, and a value of A near 1 will be required in order for lengths 
to substantially change v. In the case when all branch lengths are equal to 1, m = M and 
the distance is independent of A. The example in Figure]^ may provide some intuition. 

In order to capture length-sensitive distances between trees, we may wish to use a value 
of A such that neither (1 — A)m nor AM dominate excessively, but naturally this will depend 
on the analysis. For a more gradual change in dx as A tends to 1, and for comparison of this 
change across different data sets, it is possible to rescale the branch lengths, for example by 
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vx{Ta) = {I- A)(l, 0,0,1,1,1) + A(l, 0,0,1,1,1) 
vxm = (1 - A)(l, 0,0,1,1,1) + A(2,0,0,1,1,1) 
vx{Tc) = (1 - A)(0,1,0,1,1,1) + A(0,1,0,1,1,1) 
vx{Td) = (1 - A)(0,1,0,1,1,1) + A(0,0.5,0,1,1,1) 



Ta 

Tb Tc 

Tb 

A 


Tc 


VA2 + 2A + 2 

Td 




Figure 3: Example trees from 75 to illustrate the effect of changing A. The distance between 
Ta and Tc {dx{Ta^Tc)) is fixed for A G [0,1] because their unmatched edges have the same 
length. < dx{Ti)^Tc) for A G (0,1] because the edge which Tc and share and 

which is not found in Ti^ is shorter in than in Tc. Most entries increase with A. The only 
distance to decrease as A ^ 1 is dx{Ta^Td)^ because the difference between the lengths of 
their unmatched branches is less than one. 


dividing all branch lengths by the median, or by changing the units. However, this should 
be done with caution because information is inevitably lost through rescaling. For example, 
if a phylogenetic analysis of multiple genes from the same organism had produced trees with 
similar topologies but different clock rates (e.g. branches in trees from gene 1 were typically 
twice as long as branches in trees from gene 2), this information would be obscured by 
rescaling. 

2.3 Other metrics on labeled phylogenetic trees 

Various metrics have been defined on phylogenetic trees. For a recent comparative survey, 

see [TT] . 

The vector M{T) is similar to the cophenetic vector of Cardona et al. [5], following Sokal 
and Rohlf [29], where Mij is called the cophenetic value of tips i and j. Parts (1) and (2) 
of our proof follow directly from results in [5]. Instead of the pendant branch lengths 
Cardona et al. use the depth of each taxon, which can be considered as Mi^i. This involves 
a repetition of information between Mi^i^ Mjj and Mij whenever Mij > 0. However, their 
definition does allow for the presence of nested taxa (taxa which are internal nodes of the 
tree). Cardona et al. also note that tree vectors such as these can be compared by any 
norm L^, but that the Euclidean norm L^, which we also use, has the benefits of being 
more discriminative than larger values of p, and enabling many geometrical and clustering 
methods. 

The most widely used metric is that of Robinson-Foulds (RF) [25]. However, RF and 
its branch-length weighted version |24| are fundamentally very different from our metric be¬ 
cause they are defined on unrooted trees, whereas our metric emphasizes the placement of 
the root and all the descendant MRCAs. Similarly, the path difference metrics of Williams 
and Clifford [33] and Steel and Penny m are for unrooted trees. They compare the dis¬ 
tance between each pair of tips in a tree; in essence, they consider the distance between 
tips and their MRCA, whereas our metric considers the distance between the root and the 
MRCA. These metrics therefore capture different characteristics of trees and are only loosely 
correlated with our metric. 

The metric introduced by Billera, Holmes and Vogtmann (BHV) captures branch lengths 
as well as tree structure [3] on rooted trees. The BHV tree space is formed by mathematically 
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‘gluing’ together orthants. Each orthant corresponds to a tree topology and moving within 
an orthant corresponds to changing the tree’s branch lengths. Moving from one orthant to 
an adjacent one corresponds to a nearest-neighbor interchange move. The metric is convex: 
for any two distinct trees Ti and T 2 , there is a tree T 3 ‘in between’ them, i.e. such that 
{T^^T 2 ) = T 2 ). This is a mathematically appealing and 

useful property, in part because it allows averaging of trees [1]. However, it does not allow 
the user to choose a balance between the topology of the tree and the branch lengths. We 
provide further comparisons in Figure 

Our metric compares trees with the same set of taxa (i.e. the same tips). As a conse¬ 
quence, it is suited for studies in which there is one set of taxa, and trees can be compared 
from different genes, inference methods, and sources of data. Our metric does not capture 
distances between trees with different taxa; where the taxa overlap between two trees, our 
approach can compare the subtrees restricted to the taxa present in both trees. In contrast, 
comparisons between unlabeled trees take a different form (e.g. kernel methods [ 22 ]), suitable 
to comparing trees on different sets of taxa. 

Many phylogenetic analyses are, implicitly or explicitly, conducted in the context of 
a rooted tree. In the context of macroevolution, examples include estimates of times to 
divergence, ancestral relationships and ancestral character reconstruction. In more recent 
literature, most methods to link pathogen phylogenies to epidemic dynamics (phylodynam- 
ics) isniEac] are based on rooted phylogenetic trees. For these reasons, the fact that the 
relationships to the root of the tree play a central role in our metric allows it to capture 
intuitive similarities in groups of trees in a way that other metrics do not. 


3 Exploring tree space 

Tree spaces are large and complex. It is important to understand the ‘shape’ of a tree space 
before attempting to summarize it. Our metric creates a space which can be effectively visu¬ 
alized (Section |3.1[ ) and where islands (distinct clusters) of tree topologies can be detected. 
We demonstrate these techniques on a sample dataset of BEAST posterior trees for Dengue 
fever. Finally, in Section [33| we describe how our metric can be used to make a principled 
selection of summary trees. 


3.1 Visualizing tree space 


Visualization techniques like multidimensional scaling (MDS) [ 6 ] have been used to explore 
tree space previously, but are challenged by poor-quality projections When a set 

of distances is projected into a low-dimensional picture, there is typically some loss of in¬ 
formation, which may result in a poor-quality visualization. For example, if 10 points are 
all 3 units away from each other, this will not project well into two dimensions; some will 
appear more closely grouped than others. However, if there are only 3 such points they can 
be arranged on a triangle, capturing the distances in two dimensions. 

One approach to checking the quality of a visualization is a Shepard plot [28], which is 
a scatter plot of the true distance vs the MDS distance (i.e. the distance in the projection). 
Figure]^ shows the MDS plot of the space of trees on 6 tips (with unit branch lengths) under 
our metric and two others: RF [25] and BHV [3]. Shepard plots are included as an indication 
of the quality of each projection. 

Each metric captures differences in both shape (shown by color) and labeling. Our 
approach produces a wide range of tree distances and captures intuitive similarities (e.g. 
the similar chimp-human pairing in the yellow and gray triangles in Figure 4a). All 945 


possible tree shapes and permutations of their labels are present in the input set of trees. 
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(b) RF (c) BHV 


Figure 4: MDS projections of the shape of Te according to metrics as shown, with corre¬ 
sponding Shepard plots. Colors correspond to tree shapes, of which examples are shown 
with triangles. Symmetries correspond to permutations of the labels. In order to include the 
BHV metric in this comparison we assigned all branch lengths to be 1, with the result that 
m — M and our metric is invariant to A E [0,1]. 


and consequently there is no asymmetry that should lead to one group being separated 
from the rest. Our metric captures the symmetry in the space and illustrates this in the 
MDS projection (Figure [4a|), whereas in RF and BHV (Figures [4b| and 4c), poor-quality 


projections lead to apparent distinct tree islands where none exist. This makes detecting 
genuine islands in posterior sets of trees difficult using RF or BHV. The Euclidean nature 
of our metric means that it is well-suited to visualizations that project distances into two- 
or three-dimensional Euclidean space. The Shepard plots illustrate that the correspondence 
between the projected distances and true distances is better in our metric than the others, 
though the projection distance can be much smaller than the true distance (but not the 
converse). MDS projections are of higher quality for trees from data than in the space of all 
trees on 6 tips (e.g. Figure]^. 


























3.2 Islands in tree space 


Tree inference methods explore the set of possible trees given the data, but there are many 
alternative trees. Bayesian Markov Chain Monte Carlo (MCMC) methods as implemented 
in BEAST [9] and MrBayes [15] produce a posterior set of trees, each with associated like¬ 
lihoods. Distinct islands of trees within small NNI distance can share a high parsimony or 
likelihood [HIES]. Complicating matters further, not all taxa in a dataset will have com¬ 
plete data at all loci. In this case, there are herraces’ of many equally likely trees, with trees 
in a terrace all supporting the same subtrees for the taxa with data at a given locus m- 
These facts have deep implications for tree inference and analysis, but the difficulty of de¬ 
tecting and interpreting tree islands has meant that the majority of analyses, particularly 
on large datasets, remain based on a single summary tree method such as the maximum 
clade credibility (MCC) tree with posterior support values illustrating uncertainty, or on 
maximum likelihood or parsimony trees with bootstrap supports. Our metric can detect 
distinct clusters or islands of close tree topologies (A = 0) within a collection of trees. Since 
distance is defined by the metric that is used, these are different from previously described 
tree islands [211 [26|. 

We demonstrate our approach using the examples from the original paper introducing 
BEAST [8], where Drummond and Rambaut demonstrated their Bayesian analysis on 17 
dengue virus serotype 4 sequences from m under varying priors for model and clock rate. 
As a means of comparing posterior tree distributions under different BEAST settings, we ran 
the xml files provided in [8] through BEAST vl.8 and analyzed the resulting trees. In Figure 
[^we demonstrate MDS plots of two of these analyses: Figure [5^ is a sample of the posterior 
under the standard GTR -h T -h I substitution model with uncorrelated lognormal-distributed 


relaxed molecular clock; Figure [5b| is a sample from the posterior under the codon-position 
specific substitution model GTR -h CP, with a strict clock. These analyses demonstrate 
some of the different signals which can be detected by visualizing the metric’s tree distances: 
distinct islands are visible in (a), whereas in (b) there are some tight bunches of points but 
the posterior is not as clearly separated into distinct islands. Additionally, trees in (b) are 
more tightly grouped together, indicating that is less conflict in the phylogenetic signals in 
(b). We ran BEAST twice with the settings from (a) (using different random starting seeds), 
and found that the space of trees explored and accepted in each run was similar, with the 
same islands. It is also encouraging that the MCC tree from the first BEAST run had the 
same topology as that from the second run, and that this topology sits in the largest island 
(yellow triangle in Figure [5a]). Similarly, the MCC tree is in the largest cluster in (b). 

Islands are of concern for tree inference and for outcomes that require the topology of 
tree, which will affect ancestral character reconstruction and consequently the interpretation 
of many phylogenetic datasets [32]. However, other analyses, and tree estimation methods 
themselves, take trees’ branch lengths as well as topology into account. We find that islands 
typically merge together in the metric as A approaches 1; the posterior becomes unimodal. 


3.3 Summary trees 

Summarizing groups of phylogenetic trees is challenging, particularly when there are different 
alternative and inconsistent topologies m- MCC trees can summarize posterior distribu¬ 
tions; they rely on including the clades with the strongest posterior support but where these 
are not concordant the resulting MCC trees can have negative branch lengths. Furthermore, 
the MCC tree itself may never have been sampled by the MCMC chain, casting doubt on 
its ability to reflect the relationships in the data. 

Our metric allows us to find ‘central’ trees within any group of trees: a posterior set of 
trees, or any island or cluster of trees. To do this, we exploit the fact that our metric is 
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(a) GI relaxed clock, A = 0 



-10 -5 0 5 10 15 

(b) CP strict clock, A = 0 


Figure 5: MDS plots of dengue fever trees sampled from posteriors demonstrate differences in 
the space of trees explored by BEAST under different settings. MCC trees are marked by yel¬ 
low triangles, (a) GTR + F + I substitution model with uncorrelated lognormal-distributed 
relaxed molecular clock (b) Codon-position specific substitution model GTR -h CP, with a 
strict clock. 


simply the Euclidean distance between the two vectors vx{Ta) and Among N trees 

(i = 1 ,..., A^) in a posterior sample, we can find the tree closest to the average vector 
V = The average vector v may not in itself represent a tree, but we can 

then find the tree vectors from our sample which are closest to this average. These vectors 
correspond to trees, Tc, (not necessarily unique) which minimize the distance between v and 
vx{Tc). This minimal distance is a measure of the quality of the summary: if it is small, Tc 
is close to ‘average’ in the posterior. Tc is known as the geometric median tree |10| . The 
geometric median is one of a range of barycentric methods which can be used with our metric 
to select a tree as a representative of a group. It is also straightforward to weight trees by 
likelihood or other characteristics when finding the geometric median. This provides a suite 
of tools for summarizing collections of trees. Geometric median trees will always have been 
sampled by the MCMC, and will not have negative branch lengths. We found that within 
islands, geometric median trees are very close to the MCC tree for the island. 
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4 Concluding remarks 


The fact that our metric is a Euclidean distance between two vectors whose components have 
an intuitive description means that simple extensions are straightforward to imagine and to 
compute. For example, it may be the case that the placement of a particular tip is a key 
question. This could occur, for example, in a real-time analysis of an outbreak, where new 
cases need to be placed on an existing phylogeny to determine the likely source of infection. 
We could form a metric that emphasizes differences in the placement of a particular tip (say, 
A), by weighting A^s entries of m and M highly compared to all other entries. In this new 
metric, trees would appear similar if their placement of A was similar; patterns of ancestry 
among the other tips would contribute less to the distance. Indeed, it is possible to design 
numerous metrics, extending this one and others, and using linear combinations of existing 
metrics m- 

Our metric enables quantitative comparison of trees. It is relevant to viral, bacterial 
and higher organisms and can help to reveal distinct, likely patterns of evolution. It allows 
quantitative comparison of tree estimation methods and can provide a heuristic for conver¬ 
gence of tree estimates. There are also many applications in comparing trees derived from 
different data. For example, the metric can be used to detect informative sites which, when 
removed from sequence alignments, change the phylogeny substantially. More generally, our 
metric can find distances between any rooted, labeled trees with the same set of tips. It can 
be used to compare tree structures from a variety of scientific disciplines, including decision 
trees, network spanning trees, hierarchical clustering trees and language trees. 
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